Given my interest in data science I was very excited to read this book, and I was not disappointed. The book mainly discusses information that can be gleaned from web searches, and how it differs from how people respond to surveys and polls, which is a rather narrow topic, but the author manages to find some rather interesting tidbits from the data.
I am more interested in scientific applications of data science, but for people who are not interested in the subject, the book gives a nice overview of what data is really about. Here is an example - you want to find out about something, say how happy people are in their marriages. You send out surveys asking people about how happy they are with their spouses. The people who are responding to the surveys can say whatever they want. Maybe they are miserable, but they want to project a positive image so they say they are very happy. Maybe they see all of their friends on Facebook constantly posting about how wonderful their husbands are so they say their husbands are wonderful too. The researcher receives the surveys and concludes that all marriages are wonderful.
In the meantime the people who filled out the surveys are going onto Google and searching for "I hate my husband" or "how can I tell if my husband is cheating?" This turns out to not be too far from the actual case. For a variety of reasons people are going to say certain things although those things may not be quite true. On Facebook people tend to post idealized pictures of themselves and idealized versions of their lives. But when they go to Google the searches they perform are going to be more honest and revealing.
Google records every single search made (although the data is anonymized) and Mr. Stephens-Davidowitz has gone through those searches to attempt to draw some actual conclusions about people. For me, the results were much as expected, though less cynical people may be in for quite a shock. One example - after Obama was elected searches for "n-word president" shot through the roof. And the places where those searches were concentrated voted heavily for Donald Trump. Search data seems to indicate that racism in the US is alive and well and also seems to indicate that the election of Trump was largely driven by a racist backlash against the election of Obama.
While the fact that most people do not describe themselves as racist may seem to contradict the search data, in my opinion not many people think of themselves as racist. The people who really are racist are probably going to say "I am not racist, it's just a fact that other races are inferior to mine." Depending on people to self-report their thoughts and attitudes and actions is a very unreliable way to gather information. The Dunning-Kruger effect describes how people who are experts in a subject matter tend to downplay their expertise, while those who are not experts tend to consider themselves far more knowledgeable than they really are. The experts know enough to know how much they don't know, while the non-experts think they know it all. This describes a general inability for people to really objectively evaluate themselves, and this is where data comes in.
Data is a truly objective description of reality. There is the saying "if you torture numbers enough they will confess to almost anything" which means that it is easy to draw almost any conclusion from a large enough data set. Data science is the science of trying to find signals in data in an objective way, and that is something that is desperately needed in the world today, especially as experts are labelled "elitists" when they say things people don't want to hear.
I, as a data scientist, enjoyed this book very much. However it is written in such a way that you do not need to be a data scientist to understand it. I highly recommend this book.
Labels:
books,
data science
No comments